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Abstract 

word segmentation is the process of finding the best 
likely sequence of words from a sequence of concate- 
nated characters without spaces. Several researches 
proposed solutions to word segmentation using heuris- 
tic methods. The main task of the last methods is to 
hopefully find the best segmentation without search- 
ing the entire state spaces. This paper proposes a new 
approach for word segmentation based on parameters 
optimization by means of Genetic Algorithm. The 
approach is tested on English language using two dif- 
ferent language models taking into consideration sev- 
eral test sets. To show that the presented approach 
is domain language independent, the approach is ex- 
perimented furthermore on the Arabic language. The 
experiments show that segmentation using parameters 
optimization gives better results. 

Keywords: NLP, Word segmentation, GA 

Nomenclature 

NLP Natural Language processing 
GA Genetic Algorithms 

BNC British National Corpus 

N Number of words per candidate solution 

T Trade-off parameter 


1 Introduction 

Word segmentation is the process of determining the 
spaces positions for a sequence of words without 
spaces, or it may also mean determining the mor- 
phemes of a word like the segmentation of the word 
"unreachable" to morphemes "un, reach, able". A hu- 
man can segment words easily because of accumu- 
lative knowledge. However, this process is compli- 
cated for the machine because it does not have enough 
knowledge to deal with the ambiguity of the human 



languages. 

The word segmentation methods depend on a raw 
list of words or on a language model. Those meth- 
ods that depend on a raw list of words are called 
dictionary-based methods [6, 35, 26]. They use a 
heuristic functions that search locally for the words. 
Because of containing each local candidate solution, 
those dictionary-based methods usually have a low 
accuracy and low complexity. Generally, dictionary- 
based methods would be inefficient to segment most 
of the languages [17]. On the other hand, the methods 
that use language models are called statistical-based 
methods [33, 36]. The later methods are more accu- 
rate but highly complex. Norvig in [25] tried to solve 
the complexity problem by using a local search func- 
tion that excludes all words with a length bigger than 
twenty characters. It is worth noting that the accu- 
racy of those statistical methods depends on the size 
of the corpus that constructs the language model. 

Hybrid methods can be constructed by integrat- 
ing dictionary-based and statistical-based methods to 
enhance the complexity and accuracy. Our previous 
work in [23] , proposed a hybrid approach for word seg- 
mentation that uses a local search technique for word 
segmentation. In this approach, the segmentation is 
built through an iterative process that separates the 
first word of the local candidate of N words with the 
best score, where N is fixed number estimated heuris- 
tically. We also, used a fitness function that depends 
on both the probability and the length of the words 
to increase the accuracy. 

Although hybrid approaches [17, 38, 16, 29] tried 
to enhance the performance of the segmentation in 
terms of accuracy and the complexity, there are still, 
however, two main problems facing the word segmen- 
tation process. The first problem is the task of de- 
termining the local search space size. The previous 
task has impact on the accuracy and complexity. On 
the other hand, the second problem concerns with the 
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ambiguity problem found due to the possible unfair 
probability distribution in the language model when 
using a relatively small corpora. In this case the pure 
statistical methods will not be efficient to handle such 
situation. Teahan in [37] and Hockenmaier in [14] 
indicated that the corpus size greatly affects the per- 
formance of the segmentation system. 

Introducing a new approach for word segmenta- 
tion to handle the previous two problems is highly de- 
sirable. Particularly, estimating the parameters that 
affect the two problems will influence the accuracy 
of the segmentation process. The first parameter is 
choosing the length N of the local candidate segmen- 
tation. The second parameter is a trade-off weight 
between the length and the probability of the word. 
The latter parameter can help in solving the unfair 
probability distribution problem. To estimate those 
parameters we use a genetic-based approach for pa- 
rameter optimization. 

To this end, The main contribution of this paper 
is to propose a new approach for word segmentation 
containing a preprocessing step using the genetic al- 
gorithm to optimize segmentation parameters. The 
paper is doing so by extending and enhancing our 
previous work, found in [23], by estimating the pre- 
viously mentioned two parameters. Additionally, the 
approach is tested on several experiments and the re- 
sults are compared before and after parameter to mea- 
sure the performance. The experiments run on both 
Google n-gram and BNC language models. Several 
data sets are used in the experiments. Furthermore, 
the proposed approach is tested on English and Arabic 
languages to show that our approach is language inde- 
pendent. Also, the experimental results of relatively 
small size of the Arabic data set show that the ability 
of the proposed approach to deal with the small-size 
corpora. 

The reset of this paper is organized as the follow- 
ing; section 2 discusses some basic terminologies and 
background. Section 3 discusses the related work. 
Section 4 presents the proposed algorithm and the 
optimization method. Section 5 experiments the pro- 
posed approach on English and Arabic languages. Fi- 
nally, Section 6 concludes the paper. 

2 Background 

This section introduces some terminologies on the 
statistical-based word segmentation methods [25] as 
well as performance measurements [17, 23]. Addition- 
ally, the section gives background on the genetic algo- 
rithms [12]. 

2.1 Word Segmentation 

A computerized word segmentation can be used in 
different domains of applications. For example, word 
segmentation is an essential process in optical charac- 



ter recognition [5], where any incorrect segmentation 
of the scanned documents leads to errors in the infor- 
mation retrieval of the document. In NLP tasks, in 
particular speech recognition [30], word segmentation 
can also be used to identify the pauses in the speech 
[3] . Moreover, word segmentation can be used to iden- 
tify words in those languages that are written without 
spaces (e.g., Japanese, Chinese, and Thai). In the pre- 
vious types of languages, the words are not delimited 
by whitespace but, they must be inferred using the 
main characters sequence[10[. Word segmentation ad- 
ditionally can be used in databases schema matching 
to solve the semantic heterogeneity [21]. 

Broadly speaking, one of the most important as- 
pects of the word segmentation system is the data 
captured from the human in the form of training data. 
Since the training data is huge, it can not be used di- 
rectly in the process of the segmentation, but firstly 
it should be preprocessed and summarized [25]. The 
summary contains the words and their frequencies 
which give us the so-called unigram. Additionally, the 
summary might contain each two consecutive words 
and their frequencies to form the so-called bigram. This 
process can be generalized iteratively to form the so- 
called n-gram. 

After extracting the unigram , the entries can be 
converted to a language model where the frequencies 
are replaced by the probabilities of the word. The 
probability P of a word w in the corpus is given as 
p(w ) = count^lords) > where count(words) is the count 
of all words in the corpus. Generally, in the n-gram , 
the frequency is replaced by the conditional probabil- 
ity P(W n \W n - 1 ) of w n given a previous word w n -i, 
where P(W„|W„_i) = ■ The Proba- 

bility of a sequence of n words wi,u> 2 , ■■■,w n is given 
as p(w\W 2 ---w n ) = The n-gram, language 

model is the model in which the probability of a word 
only depends on the previous n words. For any un- 
seen word w that is not contained in the language 
model, its probability is P(w) = count(Wor 1 da) « ( i 0 -) 
[25]. Sometimes a zero frequency of words occurs in 
the n-gram, model, to cope with this problem, back- 
off process is used. In the later process, if there is 
no n — gram for certain word sequence, it looks for a 
(n — 1) — gram instead. 

Generally the performance of a word segmenta- 
tion system is usually measured in terms of precision 
(p) and recall (r) and F-measure (f) [17, 28]. Infor- 
mally, the term precision (p) indicates the percent- 
age of the correctly segmented words to the output 
words of the system. The term recall (r) represents 
the percentage of the correctly segmented words to 
the number of words in the original text. Whereas 
the term F-measure is the harmonic mean of preci- 
sion (p) and recall (r). To express the previous mea- 
surements in other words, let N\ expresses the true 
number of word boundaries in some original text, 7V 2 
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expresses the number of spaces between words of the 
output of a system, and IV3 expresses the number of 
correctly segmented output words of the system. The 
precision p = the recall r = and F-measure: 

n 2. precision. recall 

J precision-\-recall 

2.2 Genetic Algorithms 

The term optimization is the process of making some- 
thing (as a design, system, or decision) fully perfect, 
functional, or effective as possible. Usually, the opti- 
mization process includes an objective function with 
related parameters. There are several optimization 
methods existing. Genetic Algorithm (GA) is consid- 
ered one of those successful methods to handle opti- 
mization problems [34, 22], GA has higher ability to 
discover global optima and to solve multi-objective op- 
timization problems. Additionally GA has the ability 
to deal with noisy functions well and to handle large 
and complex search spaces easily. Broadly speaking, 
GA was developed for the first time in [15]. GA are 
used for optimization in different fields. In [32] GA 
is used for the optimization of fingerprint recognition. 
In [39] GA optimizes the parameters of SVR system 
to forecast the volume of sales. [13] used the GA for 
Least-cost Design of water distribution Networks. 

The idea of GA is inspired by the evolution the- 
ory of the species. In this theory, weak species are 
subjected to extinction by natural selection. On the 
other hand, strong species have a highly opportunity 
to move their genes to future generations by means of 
reproduction. Species holding the fitting combination 
of genes become dominant in the population in the 
long run. Also, during the evolution process, it might 
happen changes in genes. If these changes provide ad- 
ditional advantages for Species to survive, new species 
evolve from the old ones. 

In GA terminology, a solution vector x = 
[x\,X 2 , xjv] where x € X N of N features is called a 
chromosome or an individual. Chromosomes consist 
of N units called genes. Where Each genome governs 
one or more features of the chromosome. Typically, a 
chromosome represents a single solution x in the solu- 
tion space. Each genome represents an encoded value 
which is used in the evaluation process [4] . If the 
solution values in encoded form in the genome, then 
they are called genotype otherwise, they are called 
Phenotype [34]. Once all chromosomes of the cur- 
rent generation are evaluated, the selection process 
takes place. In the later process, the best-fitted chro- 
mosomes have a better chance to be selected to be 
parents to crossover and mutate [34]. The crossover 
represents the process of generating a new chromo- 
some that has some characteristics of both parents, 
while mutation means changing in one or more of the 
features of the chromosome. 

Generally, GA is equipped with parameters. These 
parameters can be classified into two types [2] . The 


first type is structural parameters that affect the 
structure of GA applications. Examples for this type 
of parameters include the stopping criterion, encod- 
ing schemes and the selection method. The second 
type, on the other hand, is the numerical parame- 
ters. This type includes the number of generations, 
the number of chromosomes per generation, mutation 
probability and crossover probability. There are ex- 
isting two main approaches for parameter selection. 
The first approach is the empirical method which de- 
pends on sensitivity analysis. The second approach 
is the adaptive method which uses an initial param- 
eter setting that is optimized while the algorithm is 
executed[27[. 


3 Related Work 

This section discusses several related work of word 
segmentation. It begins first by discussing the tradi- 
tional approaches for word segmentation followed by 
those related work that use genetic algorithms. 

3.1 Traditional Approaches 

Word segmentation methods can be classified into 
three categories jnamely dictionary-based, statistical- 
based and hybrid-based methods. The category of 
dictionary-based methods generally depend on the so- 
called matching algorithm and list of words. Exam- 
ples of those methods belonging to this category are 
Maximum Matching (MM), greedy matching and re- 
verse MM [6] . The second category of word segmen- 
tation is the statistical-based methods. The latter 
methods depend on word statistics. The statistics 
of words are computed using the frequency of the 
words in some corpus. The work in [25], for exam- 
ple, solved the global search problem by using a lo- 
cal search algorithm. The previous work proposed a 
method that relies on the independence assumption 
and word maximum length. The work in [31] is an- 
other example for the statistical-based methods that 
depends on the frequency of syllabus instead of de- 
pending on words frequencies. Finally, the last cat- 
egory of word segmentation methods is hybrid ap- 
proach methods. In these methods, both dictionary 
and statistical-based methods are combined in the 
same framework [23, 20, 40, 29, 17]. Our Previous 
work in [23] is considered an example of the word seg- 
mentation belonging to this category, in which we used 
a matching algorithm that matches three consecutive 
words. Additionally, we used a score function that 
depends on both probability and length of the three 
consecutive words. 

In contrast to the work of this paper, the work of 
the previous three categories do not take into consid- 
erations optimization of the segmentation parameters. 
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3.2 Genetic Approaches 

Introducing GA to word segmentation is not new. The 
work of [24] used GA to find an approximate global 
optimal segmentation from a set of possible segmen- 
tation for Vietnamese text. The experiments in this 
work showed that GA can give a considerable accuracy 
with less complexity. In contrast to the work of this 
paper, this method, however, is limited to Vietnamese 
texts, as it depends on the property of multi-syllable 
words of Vietnamese text. 

Several work used GA to morphemes segmenta- 
tion. For example the work in [18] used GA in prefix- 
suffix word segmentation to segment words into mor- 
phemes. GA have been used to extract word segmen- 
tation rules from a list of words. It is noteworthy 
that, the performance of word segmentation process 
depends on the quality of this rules list. Another work 
for morphemes segmentation using GA is the work 
presented in [11] .This work provided an unsupervised 
technique for morphemes segmentation in Spanish as 
an inflective language. Unlike the previous work, they 
did not use GA to extract rules, instead, they used it 
to find the morphemes segmentation with the highest 
fitness. Also, this work is limited to Spanish language. 
Another work for morphemes segmentation using GA 
is in [8] . This work used GA for morphemes segmenta- 
tion with general purpose and a specifically designed 
evaluation function. Particularly, the work proposed 
a suffix-based evaluation function depending on a list 
of correctly segmented words. 

Differently to the work of this paper, the previous 
GA works are limited only to morphemes segmenta- 
tion and they can not be applied in segmentation of a 
sequence of characters without spaces into meaningful 
words. In addition, these works are domain language 
dependent. 


4 Proposed Work 

This section proposes a method that enhances our pre- 
vious work [23]. The enhancement replaces the man- 
ually tuned parameters with GA optimized parame- 
ters. As it was mentioned previously, there are two 
main parameters that should be optimized before the 
algorithm starts. The first one is the number of words 
per candidate solution. We will refer to this parame- 
ter as N. While the second parameter is the trade-off 
weight between the effect of the length and proba- 
bility on the fitness function. We will refer to this 
parameter as T. Figure 1 shows the work- flow of the 
proposed method. Firstly, the GA uses the dataset to 
optimize the parameters. Once the optimized param- 
eters values are calculated, they are fed to the word 
segmentation algorithm. 



Figure 1: simple System architecture For the Param- 
eter Optimization process Using Genetic Algorithms 

4.1 The Word Segmentation Algo- 
rithm 

Because of the importance of word probability and 
word length as they are used in both statistical- 
based and dictionary-based approaches, the major- 
ity of segmentation methods use the probability and 
word length as the fitness function. Figure 2 de- 
scribes the proposed segmentation algorithm. The 
main functions that construct the segmentation al- 
gorithm are segment (text), firstOfBestCandidateSo- 
lution(inputString), Match (inputstring), candidateS- 
olution(inputString), and Pw (Previous Word, word). 
The role of the function segment(desegmented) is to 
determine the boundaries of words from the input se- 
quence of characters without spaces. It repeats the 
lines 3, line 4 and line 5. In line 3, it tries to find the 
best local candidate solution and separates its first 
word. Line 4 tries to add this word to the list of seg- 
mented words. Line 5 removes the letters of the first 
word from the unsegmented input. By repeating these 
three steps, the input string will be converted into a 
list of segmented words. The next few paragraphs 
illustrate these steps. 

The function firstOfBestCandidateSolu- 
tions (inputstring) returns the first word of the 
candidate solution which achieves the highest fitness, 
a candidate solution is N consecutive words that can 
be matched with the beginning of the input string. It 
is assumed that, the number of words per candidate 
solution was set before the algorithm starts. As it 
was said previously, GA is used to optimize this 
parameter. 

The purpose of the function CandidateSolution is 



26 
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to find N of consecutive words that can matches the 
beginning of the input string with their fitness. To 
match a single word, the function Match (inputstring) 
is used. This function returns a single word from the 
dictionary that must be matched with the beginning 
of the inputtring. Once a candidate solution is found, 
the fitness of this candidate solution is calculated, 
and a pair of ListOf Words, Score is returned. This 
Score is calculated based on probability and length 
of the word as depicted in 1. The last function is 
Pw (Previous Word, word) is used to find the probabil- 
ity of a word. 

Score = (1) 

B 

where, 

A = log(p(wordi\ < S >)) + log{p(wordi+\ \wordi) 

and B = length(wordi)) T 

It should be mentioned that the proposed algo- 
rithm will not treat the problem of unseen words. 
This means that the unseen words will be wrongly 
segmented to seen words. From our perspective, the 
statistical methods suffer from the same problem. In 
most of the cases, these methods will segment the un- 
seen word to seen words. For unseen characters or 
symbols, we will segment this character or symbol as 
a word. 

4.2 Parameter Optimization 

This section explains how the GA is used to optimize 
the two main parameters N and T of the segmenta- 
tion algorithm. The optimization starts by splitting 
data set into three sets. The first set is the train- 
ing set which is summarized into the language model. 
The second set is the development set which is used 
to compute the parameter values. The third set is 
the test set which is used to test the accuracy of the 
segmentation algorithm after parameter optimization. 

Once the first step extracted the language model 
for the training data sets, the GA starts the process of 
optimization. An initial generation of chromosomes is 
created with random values. After that, the breeding 
process starts. The later process consists of the three 
steps namely selection, reproduction and replacement 
[34], The selection step starts with the evaluation of 
the chromosomes of the entire generation. To evalu- 
ate the fitness of a single chromosome, the genotypes 
values of a genome will be converted into phenotypes 
which is used later as parameters for the segmentation 
algorithm. Then the F-measure is used to evaluate 
the fitness of each chromosome. Then, the random 
selection with a probability depending on the fitness 
takes place. In the reproduction step, which involves 
crossover and mutation, the creation of new chromo- 
somes having some characteristic of their parents is 
produced. The final step of breeding is the replace- 
ment step in which the old chromosomes are killed and 
are replaced by the new generation of chromosomes. 


In this work, the uniform crossover is applied, and the 
mutation is represented by adding or subtracting ran- 
dom values. The process of breeding is repeated until 
all generations are processed. 

At the end of GA process, the optimized param- 
eters values are computed and then are given to the 
segmentation algorithm to test the performance of the 
algorithm on a given data set. Figure 3 illustrates the 
previous steps. 

5 Experiments 

This section examines the performance of the pro- 
posed approach by doing several the experimental us- 
ing different language models and different test sets 
In both English and Arabic language. For the English 
language, we used the BNC 1 and Google N-gram 2 lan- 
guage models and set of corpora that included Brown, 
Inaugural, ABC, Shakespear and Gutneburg as test 
sets 3 . For the Arabic Language, we used Al-Watan 
dataset [1], 

This section classifies the experiments into three 
categories based on the used language model. In the 
first category, the experiments run on google N-gram 
language model. In the second category, the exper- 
iments are performed in BNC language model. In 
the last category, the experiments run on Al-watan 
Language model. Each category starts with the GA 
parameter values then shows the performance results. 
The performance is compared to the results presented 
in [23]. In these experiments, the empirical approach 
is used to determine the parameters for the genetic 
algorithm]!)] . 

5.1 Google N-gram Language Model 
Results 

This section discusses the experiments of the pro- 
posed approach using Google N-gram language model. 
The genetic algorithm was experimented several times 
with different GA parameters settings with a devel- 
opment set that contains 200, 000 word. The set- 
tings that achieved the highest f-measure was selected. 
These settings were as the following; number of chro- 
mosomes per generation: 100, number of generations: 
100, probability of crossover: 0.7, probability of mu- 
tation: 0.3 and selection strategy: Roulette Wheel 
with Elitism. The best results obtained from the ge- 
netic algorithm was 97.3% F-Measure when IV = 4 
and T = 1.3. Later, we used the optimized values of 
N and T to test the algorithm on the test sets. Table 
1 shows the results of the pre without optimization, 
while table2 shows the results after performing the 
optimization. 

Available at http://www.natcorp.ox.ac.uk 

2 available at http://norvig.com/ngrams/ 

3 Availableat : http : / /www.nltk.org/nltk^ata/ 
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1. FUNCTION Segment (desegmented) 

2. WHILE(desegmented is not empty) 

3. currentWord = firstOfbestCandidateSolution(inputString) 

4. add currentWord to the list of segmented words 

5. Remove letters of currentWord from the beginning of inputstring 

6. ENDWHILE 

7. Return wordlist 

8. ENDFUN CTION 

9. FUNCTION FirstOfBestCandidateSolution(inputString) 

10. Find all Possible Candidate solutions that can match With the InputString 

11. Result i — first word from Candidate Solution that has the Maximum Score 

12. Return Result 

13. ENDFUNCTION 

14. FUNCTION CandidateSolution(inputString) 

15. SequenceOfChars = inputstring 

16. I< i - 1 

17. WHILE k <= N 

18. Wk = Match(SequenceOfChars) 

19. Remove letters of Wk from the beginning of SequenceOfChars 

20. add Wk to the ListOfCandidateWords 

21 . k ^k+1 

22. ENDWHILE 

23. calculate the score as equation. 1 

24. Result — (ListOfCandidateWords, Score) 

25. Return Result 

26. ENDFUNCTION 

27. FUNCTION match(inputString) 

28. IF (Length(inputString) =0) 

29. Return < emptyString > 

30. ENDIF 

31. Result i — a word that match the beining of inputstring 

32. Return result 

33. ENDFUNCTION 

34. FUNCTION Pw (previous Word, word) 

35. IF (word = < emptyString >) 

36. Return 1 

37. ENDIF 

38. Return the probability of word according to back-off concept 

39. ENDFUNCTION 


Figure 2: Proposed Method 


By comparing the results before and after opti- 
mization we noticed that the F-measure is enhanced 
in all corpus. 

5.2 BNC Language Model Results 

This section discusses the experiments of the proposed 
method using BNC language model. The genetic al- 
gorithm was experimented several times with different 
GA parameters settings with a development set that 
contains 200,000 word. The settings that achieved 


Table 1: The results of the algorithm without opti- 


mization from work [23] 



Recall 

Precision 

F-measure 

Size of test set 

Brown 

96.54% 

94.63% 

95.57 % 

1003881 

Inaugural 

98.71% 

98.76% 

98.74 % 

130271 

ABC 

97.31% 

96.66% 

96.98% 

630004 

Shakespear 

93.97% 

92.43% 

93.2% 

184194 

Gutenburg 

96.91% 

96.53% 

96.72% 

1942398 



28 


the highest f-measure was selected. These settings 
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Figure 3: System architecture For the Parameter Optimization process Using Genetic Algorithms 


Table 2: The results of the proposed framework using 
n-gram 



Recall 

Precision 

F-Measure 

Size of test set 

Brown 

97.02% 

95.83% 

96.42% 

1003881 

Inaugural 

98.91% 

98.75% 

98.82 % 

130271 

ABC 

97.42% 

96.88% 

97.15% 

630004 

Shakespear 

93.95% 

92.73% 

93.34% 

184194 

Gutenburg 

97.31% 

96.50% 

96.9% 

1942398 


were as the following; following is parameter setting of 
the GA. The number of chromosomes per generation 
is 100, the number of generations is 100, probability 
of crossover is 0.7, probability of mutation is 0.1 and 
the selection strategy is Roulette Wheel with Elitism. 
The best result obtained from genetic algorithm was 
97.1% F-Measure when N = 3 and T = 1.2. Later, we 
used the optimized values of N and T to test the al- 
gorithm on the test sets. Table 3 shows the results on 
the tested corpora without optimization, while table 
4 shows the results after applying the optimization. 
Similar to the results of Google n-gram, the F-measure 
is enhanced in all corpus. 


Table 3: Exprime nt results on many corpora using 
BNC Language model of previous work [23] 



Recall 

Precision 

F-Measure 

Size of test set 

Brown 

96.88% 

96.72% 

96.79% 

1003881 

Inaugural 

97.8% 

98.4% 

98.1% 

130271 

ABC 

95.96% 

95.41% 

95.68% 

630004 

Shakespear 

94.36% 

93.49% 

93.92% 

184194 

Gutenburg 

95.08% 

94.37% 

94.72% 

1942398 


There are other works in [7, 28, 19, 17] that used 
the BNC language model and the Brown corpus as 
a test set. We included there results and compared 
them with the result of the proposed approach. Table 


Table 4: Expriment results on many corpora using 
BNC Language model of the current framework 



Recall 

Precision 

F-Measure 

Size of test set 

Brown 

96.96% 

97.06% 

97% 

1003881 

Inaugural 

97.74% 

98.84% 

98.29% 

130271 

ABC 

96.23% 

96.61% 

96.42% 

630004 

Shakespear 

94.43% 

94.45% 

94.44% 

184194 

Gutenburg 

95.54% 

95.83% 

95.68% 

1942398 


4 in section 5.2 shows this comparison. 


Table 5: Comparison between the proposed work and 
works in [7, 28, 19, 17] 



Kit in[19] 

Peng in [28] 

De Marcken in |7] 

Islam in [17] 

Proposed Framework 

Precision 

79.33% 

74.6% 

90.5% 

89.92% 

97.06% 

Recall 

63.01% 

79.2% 

17% 

94.69% 

96.96% 

F-Measure 

70.23% 

75.49% 

28.62% 

92.24% 

97% 


5.3 Al-watan Language Model 

The proposed method was tested on Arabic language 
using 10,000,000 words of Al-watan Dataset. As 
usual, the dataset is split into a training set, develop- 
ment set and test set, with the percentage of 88%, 2% 
and 10% respectively. As it was mentioned previously, 
the training set is used to build the language model 
while The development is used to optimize the param- 
eters of the proposed algorithm as depicted in figure 3. 
Finally, the proposed method was tested to measure 
the performance on the test set. The performance of 
the algorithm is evaluated in terms of precision, recall 
and f-Measure. Table 6 shows the results. 

The genetic algorithm was experimented several 
times with different GA parameters settings. The set- 
tings that achieved the highest f-measure was selected. 
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These settings include the following criteria; the num- 
ber of chromosomes per generation is 100, the number 
of generations is 100, the probability of crossover is 
0.8, probability of mutation is 0.1 and selection strat- 
egy is Roulette Wheel with Elitism. The best solution 
in the genetic algorithm achieved 92.3% F-Measure 
when N = 4 and T = 1.4. 


Table 6: Comparison between results before and after 
the optimization 



Before optimization 

After optimization 

Precision 

86.5% 

88.3% 

Recall 

91.5% 

92.1% 

F-Measure 

89.1% 

90.2% 


6 Conclusion 

In order to segment a sequence of concatenated char- 
acter without spaces into meaningful words, several 
word segmentation techniques have been proposed. 
The main task of the segmentation process is to find 
the solution with ultimate accuracy. Usually, word 
segmentation techniques use heuristic methods dur- 
ing the segmentation process to avoid searching the 
unnecessary state space from one side and to choose 
a measure to guide the quality of the solution from 
the other side. This paper showed how to optimize 
segmentation parameters by means of genetic algo- 
rithms. In particular, the genetic algorithm has been 
used side by side in the segmentation process to op- 
timize the parameter representing the length of the 
segmented local candidates generated during the seg- 
mentation process. Additionally, it has been used to 
optimize the parameter that represents the trade-off 
between the length of the segmented local candidates 
and the probability. The presented approach has been 
tested in two different languages, namely English and 
Arabic language. In the English language, the pro- 
posed approach has been experimented in both Google 
n-gram and BNC language model. Whereas in the 
Arabic language, the proposed approach was tested 
using El-Watan Language model. For each language 
model, several datasets have been taken into consid- 
eration. The proposed approach has been compared 
with the performance of a successful previous work 
without parameter optimization. The results showed 
that genetic algorithm has been proven to be an evo- 
lutionary approach that can be successfully applied to 
word segmentation. 
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